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ABSTRACT 

We address the problem of cross-referencing text fragments 
with Wikipedia pages, in a way that synonymy and poly- 
semy issues are resolved accurately and efficiently. We take 



12|[14], andex- 
ong documents 



inspiration from a recent flow of work [3j |10| 
tend their scenario from the annotation of ' 
to the annotation of short texts, such as snippets of search- 
engine results, tweets, news, blogs, etc.. These short and 
poorly composed texts pose new challenges in terms of effi- 
ciency and effectiveness of the annotation process, that we 
address by designing and engineering Tagme, the first sys- 
tem that performs an accurate and on-the-fiy annotation of 
these short textual fragments. A large set of experiments 
shows that Tagme outperforms state-of-the-art algorithms 
when they are adapted to work on short texts and it results 
fast and competitive on long texts. 

1. INTRODUCTION 

The typical IR-approach to indexing, clustering, classifi- 
cation and retrieval, just to name a few, is that based on 
the bag-of-words paradigm. In recent years a good deal of 
work attempted to go beyond this paradigm with the goal of 
improving the search experience on (un-)structured or semi- 
structured textual data. In his invited talk at WSDM 2010, 
S. Chakrabarti surveyed this work categorizing it in three 
main classes: (a) adding structure to unstructured data, 
(b) adding structure to answers, and (c) adding structure 
to queries while avoiding the complexity of elaborate query 
languages that demand extensive schema knowledge. In this 
paper we will be concerned with the first issue consisting of 
the identification of sequences of terms in the input text and 
their annotation with un-ambiguous entities drawn from a 
catalog. The choice of the catalog is obviously crucial for the 
success of the approach. Several systems nowadays adopt 
Wikipedia pages (or derived concepts) as entities, and im- 
plement the annotation process by hyper-linking meaningful 
sequences of terms with Wikipedia pages that are pertinent 
with the topics dealt by the input text. The choice of Wi- 
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kipedia is dictated by the fact that the number of its pages 
is ever-expanding (> 3 million English pages, and > 500K 
pages in each major European language) and it offers the 
best trade-off between a catalog with a rigorous structure 
but with low coverage (like the one offered by the high- 
quality entity catalogs s.t. WordNet, CYC, OpenCYC, TAP), and 
a large text collection with wide coverage but unstructured 
and noised content (like the whole Web). 

The first work that addressed the problem of annotat- 
ing texts with hyper-links to Wikipedia pages was Wikify 
[12| , followed by 3 . Recently [10[ |14| yielded considerable 
improvements by proposing several new algorithmic ingredi- 
ents, the most notable ones are: (i) a measure of relatedness 
among Wikipedia pages based on the overlap of their in- 
links; and (ii) the modeling of the process as the search for 
the annotations that maximize some global score which de- 
pends on their coherence/relatedness and other statistics. 
These ingredients allowed Milne&Witten [l^ to achieve an 
F-measure of 74.8% over long and focused input texts, and 
Chakrabarti et al JTo] to improve the recall by still obtaining 
comparable precision. 

In this paper we add to this flow of work the specialty 
that the input texts to be annotated are very short, namely, 
they are composed of few tens of terms. The context of use 
we have in mind is the annotation of either the snippets of 
search-engine results, or the tweets of a Twitter channel, or 
the items of a news feed, or the posts of a blog, etc.. As an 
example, let us consider the following (recent) news: "Diego 
Maradona won against Mexico". Our goal is to detect "Diego 
Maradona" and "Mexico" as meaningful term sequences to 
be annotated (hereafter called spots), and then hyper- link 
them with the senses represented by the Wikipedia pages 
which deal with the soccer player (now Argentina's coach) 
Maradona and the football team of Mexico. The key diffi- 
culty of this process is to detect on-the-fly those spots and 
their pertinent senses among the (possibly many) Wikipe- 
dia pages that are linked by those anchors in Wikipedia. 
In fact "Diego Maradona" is the anchor of one Wikipedia 
page, whereas Mexico is the anchor of 154 Wikipedia pages. 
A good annotator should therefore be able to disambiguate 
"Mexico" by linking this spot with the page dealing with the 
football team, rather than the state. Furthermore, it should 
prune the other spots "won" and "against" which are actually 
anchors in Wikipedia but obviously result not meaningful in 
the present news. 

It is easy to argue that these poorly composed texts pose 
new challenges in terms of efficiency and effectiveness of the 
annotation process, which (1) should occur on-the-fly, be- 



cause in those contexts data may be retrieved at query time 
and thus cannot be pre-processed, and (2) should be de- 
signed properly, because the input texts are so short that 
it is difficult to mine significant statistics that are rather 



available when texts are long. The systems of 10 



14 



designed to deal with reasonably long texts, and indeed they 
either depend on statistics that hinge on many well-focused 
spots [14] or they compute sophisticated scoring functions 
that make the whole process slow [10| . 

Given these limitations we have designed and implemented 
Tagme the first software system that, on-the-fly and with 
high precision/recall, annotates short texts with pertinent 
hyper-links to Wikipedia pages. Tagme uses as spots (to 
be annotated) the sequences of terms composing the an- 
chor texts which occur in the Wikipedia pages, and it uses 
as possible senses for each spot the (possibly many) pages 
pointed in Wikipedia by that spot/anchor. Tagme resolves 
synonymy and polysemy issues among the potentially many 
available mappings (spot-to-page) by finding a collective agree- 
ment among them via new scoring functions which are fast 
to be computed and accurate in the finally produced anno- 
tation. This paper will detail the algorithmic anatomy of 
Tagme and present a large and variegate set of experiments 
that will validate the algorithmic choices made in the design 
of Tagme, and will experimentally compare Tagme against 
the two best-known systems [lO 14 both on short and long 
texts. Sect. 13.11 will detail our achievements, here we sum- 



marize that on short texts Tagme outperforms the best 
known systems either in accuracy or speed, or both. On long 
texts, Tagme is competitive in accuracy and still very fast, 
with a time complexity that grows linearly with the number 
of processed anchors (cfr. 10 's quadratic time complexity 



fa 



Tagme is available for test at http : / /tagme . di . unipi . it 

2. NOTATION AND TERMINOLOGY 

A (text) anchor for a Wikipedia page p is a text used 
in another Wikipedia page to point to p. In Wikipedia, 
this can be the title of p, one of its synonyms or acronyms, 
or even it may consists of a long phrase which might be 
(much) different syntactically from p's title. As an example, 
an anchor text for the page "Nintendo DS" is the acronym 
"nds" as well as the phrase "Gameboy ds" or "Nintendo Dual 
Screen". For coverage purposes, we enrich all anchors of 
p with the title of the redirect pages that link to p. This 
approach derives a total of about 8M distinct anchors from 
Wikipedia (English). In this paper, following [To|, we will 
interchangeably use the terms anchor and spot. 

Because of polysemy and variant names, the same anchor 
a may occur in Wikipedia many times pointing to many 
different pages. We denote this set by Pg{a), and use the 
notation: freq{a) to denote the number of times a occurs 
in Wikipedia (as an anchor or not); and link{a) to indi- 
cate the number of times the text a occurs as an anchor 
in Wikipedia (of course link{a) < freq{a)). Also we use 
lp{a) = link{a) / freq{a) to denote the /mfc-probability that 
an occurrence of a is an anchor pointing to some Wikipe- 
dia page; and use Pr(p|a) to denote the pnor-probability 
that an occurrence of an anchor a points to a specific page 
p G Pg{a). This latter is also called commonness of p. 

^A preliminary (and short) description of Tagme has been 
published as poster in CIKM 2010 The present paper 
contains an engineered version of TAGME with a larger set 
of experiments, comparisons and findings. 



The annotation of an anchor a with some page p € Pg{a) is 
denoted by a 1— >■ p. Often a has more senses, thus |P(?(a)| > 
1, so we call disambiguation the process of selecting one of 
the possible senses of a from Pg{a). It goes without saying 
that not all occurrences of the anchor a should be considered 
as meaningful and thus be annotated. So we follow [To] 
and introduce a fake page NA that is used to prune the un- 
meaningful annotations, via the dummy mapping a n> NA. 

3. RELATED WORKS 

The literature offers two main approaches to enrich a (pos- 
sibly short) text with additional structure and information 
that may empower subsequent IR-steps such as clustering, 
classification, or mining. 

One approach consists of extending the classic term-based 
vector-space model with additional dimensions correspond- 
ing to features (concepts) extracted from an external knowl- 
edge base, such as DMOZ [4||6], Wikipedia [7|ni9], or even 
the whole Web (such as the Google's kernel 15|). Proba- 
bly the best achievements have been obtained by querying 
Wikipedia (titles or entire pages) by means of short phrases 
(possibly single terms) extracted from the input text to be 
contextualized. The result pages (typically restricted to the 
top-fc) and their scores (typically tf-idf) are used to build 
a vector that is considered the "semantic representation" of 
the phrase and is finally used in classification clustering 
[l][9], or searching flgl processes. The pro of this approach is 
to extend the bag-of-words scheme with more concepts, thus 
possibly allowing the identification of related texts which are 
syntactically far apart. The cons resides in the contamina- 
tion of these vectors by un-related (but common) concepts 
retrieved via the syntactic queries. 

In order to overcome these limitations, some authors have 
tried to annotate only the salient text fragments present in 
an input text, without resorting to the vector-space model. 
Their key idea is to identify in the input text short-and- 
meaningful sequences of terms and connect them to unam- 
biguous senses drawn from a catalog. The catalog can be 
formed by either a small set of specifically recognized types, 
most often People and Locations (aka Named Entities), or it 
can consists of millions of senses drawn from a large knowl- 
edge base, such as Wikipedia. In the former case (see e.g. 



[I7 20 ), substantial training and/or human effort is re- 
quired to eventually produce a "coarse" annotation: these 
systems would probably recognize that a sequence of terms, 
say Michael Jordan, is the name of a person, but they would 
miss to disambiguate which person the occurrence is refer- 
ring to (in fact, Wikipedia contains 7 persons with the name 
Michael Jordan). In the latter case, the annotation can take 
advantage of million senses (currently more than 3 million 
English pages, and more than 500K pages in each major Eu- 
ropean language) and several million relations among them. 
This catalog is ever-expanding and currently offers the best 
trade-off between a catalog with a rigorous structure but 
with low coverage (like the one offered by the high-quality 
entity catalogs s.t. WordNet, CYC, TAP), and a large text 
collection with wide coverage but unstructured and noised 
content (like the whole Web). 

To our knowledge the first work that deployed this huge 
knowledge base of senses and relations to efficiently and ac- 
curately cross-reference long documents was Wikify [12| , 
soon followed by f3]. Recently, Milne&Witten [Til proposed 
an approach that yielded considerable improvements by hing- 



ing on three main ingredients: (i) the identification in the 
input text of a set C of so called context pages, namely Wi- 
kipedia pages pointed by anchors that are not ambiguous 
(because they link to just one page/sense); (ii) a measure 
rel(p\,p2) of relatedness between two pages pi,p2 based on 
the overlap between their in-linking pages in Wikipedia; and 
finally (iii) a notion of coherence between a page p and the 
other context pages in C. Given these, the disambiguation 
of an anchor a was obtained by using a classifier that ex- 
ploited for each sense p G Pg{a): the commonness Pr(p|a) 
of the annotation a i— >■ p, the relatedness rel{p,c) between 
the candidate sense p and all other context pages c G C 
(which are un-ambiguous), and the coherence of each c with 
respect to the entire input text. Then anchor pruning was 
performed by using another classifier that mainly exploited 
the location and the frequency of a in the input text, its link 
probability, the relatedness between p and the un-ambiguous 
pages of C, and the confidence of disambiguation assigned 
by the classifier to a i— >■ p (at the previous step) . In 14 the 
authors showed an F-measure of 74.8% but this holds for 
"reasonably long and focused" texts|^ which seems unsuit- 
able for our scenario in which we need to process short (and 
thus potentially ambiguous) input texts. 

Last year, Chakrabarti and his group ^lOj proposed an an- 
notator based on two other novelties. The first one was to 
evaluate an annotation a i— > p with two scores: one local to 
the occurrence of a (and involving 12 features) and the other 
global to the entire input text (and involving all the other an- 
notations and a relatedness function inspired by flsl). The 
second novelty was to model the entire annotation process 
as a search for the mapping that maximizes some global 
score, via the solution of a (sophisticated) quadratic assign- 
ment problem. Extensive experiments showed that [lOj 's ap- 
proach yields precision comparable to Milne&Witten's sys- 
tem but with a considerable higher recall. Unfortunately, 
the system is slow since it takes > 2 seconds to annotate a 
text of about 15 anchors (see Figure 13 of [lO]); this is due to 
the sophisticated annotation process (recall the quadratic- 
assignment problem above) and the many term- vector com- 
parisons and computations. This annotation speed is ac- 
ceptable for an off-line setting, like the one considered in 
[10| , but it is unsuitable for our setting where we wish to 
annotate on-the-fly many short texts (possibly coming from 
the results of a search engine, or a tweet channel). 

In summary, the systems of [10[ |14| seem unsuitable to 
annotate on-the-fly short and poorly composed texts, given 
that they either depend on statistics that hinge on many 
well-focused spots [14] or they compute sophisticated scor- 
ing functions that make the whole process slow The 



numerous experiments in Sect. [Sjwill sustain this intuition, 
and will validate the use of Tagme not only for the annota- 
tion of short texts but also for the long ones. 

3.1 Our Results 

The first goal of this paper is to describe the algorithmic 
anatomy of Tagme, the first software system that anno- 
tates short text fragments on-the-fiy and with high preci- 
sion/recall by cross-referencing meaningful text spots (i.e. 
anchors drawn from Wikipedia) detected in the input text 
with one pertinent sense (i.e. Wikipedia page) for each of 

^This is the response message of the system, available at 
http://wikipedia-miner.sourceforge.net, when the in- 
put text is too short. 



them. This annotation is obtained via two main phases, 
which are called anchor disambiguation and anchor pruning. 
Disambiguation will be based on finding the "best agreement" 
among the senses assigned to the anchors detected in the in- 
put text. Pruning will aim at discarding the anchor-to-sense 
annotations which result not pertinent with the topics the 
input text talks about. So the structure of Tagme mim- 
ics the one of [10| |14| 's systems but introduces some new 
scoring functions which improve the speed and accuracy of 
the disambiguation and pruning phases. The algorithmic 
contribution of Tagme's design therefore consists of: 

• a new voting scheme for anchor disambiguation that 



builds upon the relatedness function proposed in 13 



to find the collective agreement among all anchor-sense 
matches detected in the (short) input text. The spe- 
cialty of this voting is that it is simple and thus fast 
to be computed (cfr. fTo''s quadratic assignment prob- 
lem), and it judiciously combines the relatedness among 
all candidate annotated-senses (cfr. [14] 's un-ambiguous 
pages only) in order to account for the sparseness of 
the anchors which is typical of the annotation of short 
texts. 

• the design and test of several pruning schemes which 
build upon two simple features extracted from each 
candidate annotation: the link probability of its an- 
chor and the coherence between that annotation and 
all other candidate annotations detected in the input 
text. Although these features have been already used 
by [10| |14| , Tagme will combine them in many new 
ways. The final result will be a large spectrum of pos- 
sible pruning approaches, from which we will choose 
the final Tagme's pruner that will consistently improve 
known systems, yet remaining sufficiently simple and 
thus fast to be computed. 

The second goal of this paper is the execution of a large 
and variegate set of experiments on publicly available da- 
tasets [To] as well as on new large datasets that we have 
created and made available to the community]^ These ex- 
periments will aim at (i) testing the novel disambiguation 
and pruning algorithms described above, in order to set 
the best choice for Tagme, (ii) comparing Tagme against 
the two best-known systems — namely, Chakrabarti's and 
Milne&Witten's — both on short and long texts, in order 
to derive principled conclusions about speed and accuracy 
issues of these annotators. In this respect, the contribution 
of our paper will be the following: 

• On short texts we will show that Tagme outperforms 
Milne&Witten's system by yielding an F-measure of 
about 78% (versus their 69%), with the possibility to 
balance precision (up to 90%) vs recall (up to 80%), at 
similar annotation speed. The system of Chakrabarti 
et al has not been tested because unavailable|2 anyway, 
as commented in Sect. [3] it could not be used in our 
context because it is very slow since it takes > 2 sees 
per 15 anchors ^lOj. This is more than one order of 
magnitude slower than Tagme, which takes less than 
2ms per anchor (see Sect. 5.5 for details). 



http : // acube . di . unipi . it/datasets 

*S. Chakrabarti's personal communication. 



On long texts, Tagme is competitive with tlie two 
above systems in terms of accuracy with the advan- 
tage of offering a faster speed (still less than 2ms per 
anchor). This is due to its algorithmic structure that 
guarantees a time complexity linear in the number 
of processed anchors (cfr. [loj's quadratic time com- 
plexity), and an efficient internal-memory utilization 
bounded by 200Mb (cfr. [14] 's software that uses more 
than 1.5Gb). 



4. THE ANATOMY OF TAGME 

Tagme indexes some distilled information drawn from the 
Wikipedia snapshot of November 6, 2009. 

Anchor dictionary. We took all anchors present in the 
Wikipedia pages, and augmented them with the titles of 
redirect pages plus some variants of the page titles, as sug- 
gested in 1^ . We then removed the anchors composed by one 
character or just numbers, and also discarded all anchors a 
whose absolute frequency {link{a) < 2) or its relative fre- 
quency {lp{a) < 0.1%) was small enough that we could argue 
a being unsuitable for annotation and probably misleading 
for disambiguation. The final dictionary contains about 3M 
anchors, and it is indexed by Lucen^ 

Page catalog. We took all Wikipedia pages and discarded 
disambiguation pages, list pages, and redirect pages, because 
un-suitable as senses for anchor annotation. The remaining 
2.7M pages were indexed by Lucene. 

In-hnk graph. This is a directed graph whose vertices 
are the pages in the Page Catalog, and whose edges are 
the links among these pages derived from the Wikipedia- 
dump called "Page-to-page link records". This graph con- 
tains about 147M edges, and is indexed in internal-memory 
by Webgraplf] 

Tagme uses these data structures to annotate a short 
text via three main steps: (anchor) parsing, disambiguation 
and pruning. Parsing detects the anchors in the input text 
by searching for multi-word sequences in the Anchor Dic- 
tionary; Disambiguation judiciously cross-references each of 
these anchors with one pertinent sense drawn from the Page 
catalog; Pruning discards possibly some of these annotations 
if they are considered not meaningful for contextualizing the 
input text. Everything is designed to occur on-the-fiy and 
achieve high precision/recall. Details follow. 

4.1 Anchor parsing 

Tagme receives a short text in input, tokenizes it, and 
then detects the anchors by querying the Anchor dictionary 
for sequences of up to 6 words. Since anchors may overlap or 
be substring one of another, we need to detect their bound- 
aries. We simplified the approach of [s] in the following 
way: if we have two anchors ai,a2 s.t. ai is a (word-based) 
substring of a2, we drop ai only if lp{ai) < lp{a2). This 
is because ai is typically more ambiguous than 02 (being 
one of its substrings), and editors like to link more specific 
(longer) word sequences. Therefore, we prefer to discard a\ 
in order to ease the subsequent disambiguation task. As an 
example, consider ai = "jaguar" and 02 = "jaguar cars": in 
this case if we didn't discard a\ , disambiguation task would 



uselessly handle all possible senses of "jaguar" thus slowing 
down the process and making it more cumbersome. 

On the other hand, it might be the case that lp{a\) > 
lp{a2). Given that freq{ai) > freq{a2), this may occur 
only if link{ai) 3> link{a2). This is the case when 02 adds a 
non-meaningful word to ai that nonetheless identifies some 
senses. As an example, consider ai = "act" and 02 = "the 
act" for which it is lp{a\) > lp(a2): in fact, "act" refers to 
a huge amount of possible senses (Act of parliament, Aus- 
tralian Capital Territory, Act of a drama. Group Action, 
etc.), while "the act" is the name of a band and the title 
of a musical with a consequent small number of link occur- 
rences. In this case we keep both anchors because, at this 
initial step of the annotation process, we are not able to 
make a principled pruning. 

4.2 Anchor disambiguation 

This phase takes inspiration from [To 13 14], but extends 
their approaches to work accurately and on-the-fly over short 
texts. As in 10 , we aim for the collective agreement among 
all senses associated to the anchors detected in the input 
text and, as in [14] , we take advantage of the un-ambiguous 
anchors (if any) to boost the selection of these senses for 
the ambiguous anchors. However, unlike these approaches, 
we propose new disambiguation scores that are much sim- 
pler, and thus faster to be computed, and take into account 
the sparseness of the anchors and the possible lack of un- 
ambiguous anchors in short texts. 

More precisely, given a set of anchors At, detected in 
the short input fragment T, Tagme tries to disambiguate 
each anchor a G At by computing a score for each possible 
sense pa of a (hence pa G Pg{a)). This score is based on a 
new notion of "collective agreement" between the sense pa 
(Wikipedia page) and the senses (pages) of all other anchors 
detected in T. The (agreement) score of a 1— >■ pa is evaluated 
by means of a voting scheme that computes for each other 
anchor b £ At \ {a} its vote to that annotation. Given that 
b may have many senses (i.e. |-P<7(6)| > 1) we compute this 
vote as the average relatedness between each sense ph of h 
and the sense pa we wish to associate to a. The relatedness 
between the two Wikipedia pages Pa and Ph is computed as 
suggested in 13 as: 



rel{pa,Pb) 



log(max(|m(pa), in{pt)\)) - log{\in(pa) n in(pt) 
log{W) - log(min(jm(pa), m(p6)|)) 



where m(p) is the set of Wikipedia pages pointing to page 
p and W is the number of pages in Wikipedia. Hence the 
voting given by anchor b to the annotation pa is: 



votet{pa) = 



Ep,6Pg(b) rel{pt,pa) ■ Pr(pi,|b) 

\Fm\ 



" http : / /lucene . apache . org 

^http : //webgraph. dsi .unimi . it 



We notice that the average is computed by weighting each 
relatedness rel{pa,Pb) with the commonness of the sense pb 
(i.e. Pr(pi,j6)), because we argue that not all possible senses 
of b have the same (statistical) significance. So if b is un- 
ambiguous, it is Pr(pb|fo) = 1 and |P(;(6)| = 1, and thus we 
have voteb{pa) ~ rel{pb,Pa) and hence we fully deploy the 
unique senses of the un-ambiguous anchors (as it occurred 
in [14]). But if b is polysemous, only the senses Pb related 
to Pa will mainly affect voteb{pa) because of the use of the 
relatedness score rel{pb,pa). 



Finally, the total score for the annotation a i-^ pa is com- 
puted as the sum of the votes given by all other anchors b 
detected in T: 

rela{Pa) = ^ votet(pa) 

beAT\{a} 

This score is not enough to obtain an accurate disambigua- 
tion, so we combine it with the commonness of the sense pa 
for a (i.e. Pr(pa|a)), used as the "statistical support" for the 
significance of this annotation. There are of course many 
possible ways to combine these two values. In this paper 
we investigate two approaches: Disambiguation by Classi- 
fier (shortly DC) and Disambiguation by Threshold (shortly 
dt). DC uses a classifier that takes the above two scores as 
features and computes a value that can be interpreted as 
the "probability of correct disambiguation" for the mapping 
a Pa- Then it annotates a with the sense pa £ Pg{a) that 
reports the highest classification score. 

On the other hand, dt avoids the use of classifiers and rec- 
ognizes a roughness in the value of the voting-score rela{Pa) 
among all pa € Pg{a). So it first determines the sense pbest 
that achieves the highest relatedness rela{pbest) with the an- 
chor a, and then identifies the set of other senses in Pg{a) 
that yield about the same value of rela{pbest), according 
to some fixed threshold e. Finally dt annotates a with 
the sense Pa that obtains the highest commonness Pr{pa\a) 
among these top-e senses. 

Given that speed is a main concern, both DC and dt dis- 
card from the above computation all senses whose common- 
ness is lower than a properly set threshold r. In fact, as il- 
lustrated in [T4j, the distribution of Pr(p|a) follows a power 
law so we can safely discard pages at the tail of that dis- 
tribution. The setting of r clearly affects the precision of 
the disambiguation process: if r is too large, precision de- 
creases because we would discard many pertinent senses; if 
T is too small, speed and recall decrease. In Sect. [5^2T] we 
will perform a wide set of experiments to evaluate these two 
algorithms and their parameter settings. 

We conclude this section by pointing out the key differ- 
ences between Tagme's and the disambiguation-scores pro- 
posed by Milne&Witten and Chakrabarti et al (see Sect.|3|. 
As for the former, disambiguation was based only on un- 
ambiguous anchors which are possibly missing in the short 
input texts. The consequence on the performance of Tagme's 
disambiguation, as reported in the following Table [3] is a 
significant improvement in Recall (+6.5% absolute) and a 
slightly decrement in Precision (-0.8% absolute), which give 
an absolute improvement in the F-measure of Tagme versus 
the one of [l4] on short texts of about +3%. As for [lOj, we 
recall that Chakrabarti et al used vectors over terms and 
over all involved senses (pages), conversely Tagme uses im- 
plicitly few short vectors, one vector per detected anchor 
and one dimension per detected sense (actually restricted to 
the un-discarded ones). This clearly induces a significantly 
better speed (more than one order of magnitude) and similar 
accuracy as commented in the following sections. 

As a final comment, we note that previous works [s] |10[ 
|12| deployed also the text surrounding anchors for boosting 
the efficacy of the disambiguation process. We tested these 
features in the design of Tagme but we either got worse ac- 
curacy or slower speed of annotation. We therefore dropped 
them from the final design of Tagme, and do not report 



these numbers for the lack of space. Nonetheless we believe 
that they could turn out to be useful in the applications of 
Tagme discussed in Sect. [6] So we plan to dig into them in 
the near future. 

4.3 Anchor pruning 

The disambiguation phase produces a set of candidate an- 
notations, one per anchor detected in the input text T. This 
set has to be pruned in order to possibly discard the un- 
meaningful annotations. These "bad annotations" are de- 
tected via a simple, yet effective, scoring function that takes 
into account only two features: the link probability lp{a) of 
the anchor a and the coherence between its candidate anno- 
tation a 1-^ Pa (assigned by the Disambiguation Phase) and 
the candidate annotations of the other anchors in T. The 
effectiveness of the link probability in detecting significant 
anchors has been proved in [14] . The usefulness of the co- 
herence was also shown in |14| , but limited to the case of 
itn-ambiguous anchors. Tagme extends this notion to all 
anchors present in T by introducing a novel formula that 
is based on the average relatedness between the candidate 
sense pa and the candidate senses pb assigned to all other 
anchors b. More precisely, if S is the set of distinct senses 
assigned to the anchors of T after the Disambiguation Phase 
(say |5| > 1), we compute: 

coherence{a 1-^ Pa) ^ j^^—j ^ rel{pb,Pa) 

The goal of the pruning phase is to keep all anchors whose 
link probability is high or whose assigned sense (page) is co- 
herent with the senses (pages) assigned to the other anchors. 
We investigated five different implementations of this idea: 
two are based on a proper arithmetic combination of the 
values Ip and coherence; whereas the other three implemen- 
tations deploy the classifiers C4.5, Bagged C4.5 and Sup- 
port Vector Machine. Each pruner computes for each can- 
didate annotation a pruning score, say p(a i— >■ p), and then 
compares it against a properly set threshold Pna, so that if 
p{a !-)• p) < Pna then that annotation for a is discarded by 
setting a i— > NA. The parameter p^f, allows to balance recall 
vs precision, and its impact will be experimentally evaluated 
in Sect. [5^231 

The details of five pruners follow. The first two are very 
simple: one computes the average of Ip and coherence as 
pkva{a Pa) = {ip{a) + coherence{a k-> pa))/2; the other 
computes a linear combination puiia i-^ pa) = a ■ lp{a) + 
P ■ coherence{a i— >■ pa) -|- 7 in which the 3 parameters are 
trained via linear regression. Conversely the three classifier- 
based pruners are implemented by taking Ip and coherence 
as input features and return a value (confidence) that can be 
interpreted as the "probability of not-pruning" the evaluated 
annotation a 1— >■ p. 

Sect |5.2.'2l will evaluate the performance of these pruners 
and will show that, although much simple in using just two 
features, they are all effective. Our final choice will be in 
favor of pavg because of its simplicity (hence, speed) and its 
avoidance of any training step. 

5. EXPERIMENTAL EVALUATION 

In the following subsections we will address some key ques- 
tions that pertain with the efficiency and efficacy of Tagme: 
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Figure 1: Number of Wikipedia anchors found in 
web snippets and tweets. 



• How much is the coverage of Wikipedia's anchors in 
short texts like the ones occurring in web-search snip- 
pets and tweets? This is crucial, because we wish to 
understand how much useful can be the usage of Wi- 
kipedia anchors for annotation. (See Sect. |5.l] ) 

• How much effective are the various disambiguation and 
pruning phases introduced above? And how do we 
choose the best algorithms and parameter settings for 
Tagme? (See Sect. K^) 



How does Tagme compare against best-known anno- 
10 on short texts and on long texts ? (See 



tators 
Sect 



5'3|5'4[ ) 

How much fast is Tagme, and how its speed compares 
with the systems of [Tol[l4]? (See Sect[531) 



5.1 Coverage by Wikipedia anchors 

First we want to evaluate the coverage of Wikipedia as 
catalog of senses for the annotation of short texts drawn 
from the Web. We consider two types of text fragments: 
web snippets and micro-blogging (namely, tweets), which 
constitute a worst-case setting for an annotator because of 
their shortness and much poor textual composition. We de- 
rived these datasets by parsing about 5K tweets (of 14 words 
each, on average) and about 133K Web snippets (of 30 words 
each, on average^ 

Figure [l] reports statistics on the number of Wikipedia 
anchors found in those fragments: we point out that more 
than 93.9% tweets and 98.5% web-snippets have at least 3 
anchors. So Wikipedia offers an unexpected large coverage 
of anchors/senses even in these challenging scenarios. We 
complemented this quantitative test with a qualitative eval- 
uation about the "significance" of the presence of an anchor 
in a web-snippet or a tweet. We used lp{a) as an estimate of 
the meaningfulness of a, as suggested in 



12 



14 



and com- 



puted the distribution of anchor's link-probability among 
web-snippets and tweets. Results are not reported for the 



''Tweets and web snippets are gathered by using "The 1000 
most frequent web search queries issued to Yahoo! Search". 
We randomly selected 300 queries from this dataset, per- 
formed searches on Tweeter and collected the first 20 re- 
sults. For web snippets, we used almost all queries in that 
dataset and we collected the top 200 results from each query 
on Yahoo! search engine. 



lack of space, however we note that for at least 95% of the 
short texts (both snippets and tweets) the top-lp is larger 
than 6.5%, which [T4j considered a strong indication of a 
significant anchor. This percentage remains high, namely 
> 90%, when we average the values of the top-5 Ips of the 
anchors detected in the short text. These results support our 
hypothesis that Wikipedia is a significant catalog of senses 
also for the short texts drawn from the Web. 

5.2 Setting up TAGME 

For setting up Tagme we used three datasets derived from 
Wikipedia, as done in [T^- The first dataset, denoted WlKl- 
Disamb30, consists of 1.4M short fragments randomly se- 
lected from Wikipedia pages. Each fragment consists of 
about 30 words (like Web snippets' composition). To avoid 
any advantage to Tagme, we were careful in selecting frag- 
ments that contain at least one ambiguous anchor-text (i.e. 
\Pg{a)\ > 1). The second dataset, denoted Wiki-Annot30, 
consists of 150K fragments constructed as follows. Since Wi- 
kipedia authors usually link only the first occurrence of an 
anchor a in a Wikipedia page z, a short fragment Tz ran- 
domly drawn from z could contain occurrences of a which 
are un-annotated. Therefore we expand Wiki-Annot30 by 
extending the annotation a i-^ p occurring in z to all occur- 
rences of a in this page (and thus to the ones in the fragment 
Tz). After this expansion, Wiki-Annot30 contains about 
1.5M anchor occurrences, of which 47% are annotated. The 
third dataset, denoted Wiki-Long, consists of about lOK 
randomly selected Wikipedia articles that contain at least 
10 links. This dataset contains about 270K links in total, 
and models the case of highly linked and long texts. 

To evaluate the performance of the Disambiguation Phase, 
we use standard precision and recall scores; whereas for 
the overall annotation process (disambiguation-|-pruning) we 
follow [lo] and thus focus on the precision Pann and re- 
call Rann measurcs that are computed on the set of anchors 
which are annotated in the ground truth (i.e. the corpora 
above). These last measures are much demanding because 
they ask for a perfect match between the annotation in the 
ground truth and the one obtained by the tested system. 
If the goal is to identify topics in the text fragment, then 
it doesn't matter which anchors got annotated but which 
senses got linked. So, let G{T) be the senses (pages) associ- 
ated to the anchors of T in the ground truth, and let S{T) 
be the senses identified by the tested system over T. As in 
we define a topic-based notion of precision (Ptopics) and 
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recall {R, 



topics 



over g{T) and S{T). 



5.2.1 Setting the disambiguation phase 

We recall that DC and dt are the two approaches to disam- 
biguation proposed in Sect. [X2l Here we experiment them 
by splitting Wiki-Disamb30 into two parts: one contains 
400K anchors and is used for training, the other contains 
the remaining IM anchors. 

Approach DC. We trained a C4.5 classifier that was shown 
to achieve the best results for disambiguation in [l4]. We 
also tested several values of the parameter r (which controls 
the pruning ex ante of some pages), and discovered that 
r = 0.5% gives the best results: larger values do not gain 
precision, but reduce significantly the recall. 

Approach dt. In addition to r, this approach depends on 
a parameter e which controls the "roughness" in the value 
of the voting score. Setting e close to zero leads dt to al- 



ways select the sense that achieves the highest value of rela 
(i.e. the most related sense, shortly MR), while e close to 1 
leads DT to select the sense with the highest commonness 
(i.e. the most common sense, shortly MC). If all senses Pa 
get relaijPa) = 0, we decided to set a i— >■ NA because it is 
reasonable to argue that they are un-related with the topics 
of the input text. 




Figure 2: Performance of DT by varying e, r over the 
training dataset, where MC and MR denote the choice 
of the Most Common (ie. e = 100%) and the Most 
Related (ie. e = 0%) sense, respectively. 

Figure [2] plots the F-measure for DT by varying r and e. 
The performance of DT for values of e > 50% are not plotted 
because they are very close to the most-common sense MC. 
Overall, lower values of e are better; experiments lead us to 
choose r = 2% and e = 30%. 

Given these parameter settings, we compared DC and DT 
over the testset of IM anchors drawn from Wiki-Disamb30. 
Performances are shown in Table[l]below and, although pre- 
cision and recall are very close, we decided to use DT in 
Tagme because of three main reasons: (i) it has a better 
F-measure, (ii) the choice of r = 2% discards many un- 
significant pages and thus gains much speed, as explained in 
Sect. 4.2 (iii) DT depends on a threshold e that gives flexi- 



bility: we can increase e if the input texts are too ambiguous 
(and thus choose more often the most-common sense) or de- 
crease e if the input texts are more focused (and thus choose 
more often the most-related sense). 



Precision 


Recall 


F-measurc 


DC 91.7 


89.9 


90.8 


DT 91.5 


90.9 


91.2 



Table 1: Disambiguation performance over WlKl- 
DisambSO. 

5.2.2 Setting the pruning phase 

As detailed in Sect. |4.3[ the pruning step hinges on two 
features, Ip and coherence. We tested two combination of 
these features (AVG and LR) and three different classifiers 
deploying these features (C4.5, Bagged C4.5 and SVM). We 
trained these classifiers over 50K short texts extracted from 





p 


Rann 


F-measure 


Only Ip 


75.50 


72.01 


73.71 


AVG 


76.27 


76.08 


76.17 


LR 


76.49 


75.74 


76.10 


C4.5 


76.72 


76.22 


76.47 


Bagged C4.5 


76.54 


76.22 


76.38 


SVM 


76.25 


75.96 


76.11 




Ptopics 


Rtopics 


F-measure 


Only Ip 


76.85 


76.65 


76.75 


AVG 


78.41 


77.48 


77.94 


LR 


78.42 


77.03 


77.72 


C4.5 


76.78 


79.69 


78.21 


Bagged C4.5 


79.13 


77.12 


78.11 


SVM 


78.91 


77.13 


78.01 



Table 2: Performance of various pruners over WlKl- 
Annot30, using annotation and topics metrics. 



Wiki-Annot30, and then tested them over the remaining 
lOOK short texts. The problem we faced was to gener- 
ate both positive and negative training cases from the 50K 
texts. We thus proceeded as follows. We run the disam- 
biguator DT over those texts and compared its annotation 
with the (available) ground truth for them, thus totalling 
~ 10 X 50K = 5007^ anchor annotations. There are three 
cases: if the anchor is linked in the ground truth and the 
linked page coincides with the one assigned by DT, then it 
is a positive case (with its Ip and coherence values); if it is 
linked in the ground truth but the linked page differs from 
the one assigned by DT, then it is discarded from the train- 
ing; all other cases are considered as negative cases for the 
training set (with their Ip and coherence values). At the 
end remained a total of 460K training cases. Moreover, in 
order to train the three parameters of the approach based 
on linear regression (LR), we transformed the boolean class 
of the ground truth (linked or not linked) into a numeric 
value: we set Plr = 1 for positive cases (i.e. linked anchors) 
and Plr ~ for negative cases. 

After training, we evaluated our pruners over the remain- 
ing lOOK fragments of Wiki-Annot30 by varying Pna in 
[0, 1] using a step of O.Ol (recall that Pna controls the sen- 
sibility of our annotation process). In these experiments 
we included another simple pruner that we called "Only Zp" 
which uses only the link probability of the evaluated an- 
chor to prune the un-meaningful annotations. By compar- 
ing this approach against the others, we can evaluate the 
significance of using the feature coherence in addition to the 
link-probability in the pruning step. We tested also the case 
of coherence- only feature but performance was worse, and is 
not reported for space reasons. 

Table [2] summarizes all experimental results for the set- 
tings of Pka that yield the highest F-measure, using 2-fold 
cross validation. As expected, annotation measures are more 
severe than topics measures, although there are dependen- 
cies between them, and indeed the ranking of the pruners 
is the same in both of them. The overall performance of 
all pruning approaches is very close to each other. Results 
also show that "Only Ip" is surpassed by all other approaches 
that deploy also coherence, which confirms the usefulness of 
both features in the pruning phase. As a result, and inspired 
by the Occam Razor principle, we decided to implement in 



Tagme the simplest pruning method based on Pavg- This is 
because SVM is very slow, the others are as fast as AVG 
but they need a training step that we prefer to avoid in the 
Web context. The final setting for Tagme uses p^A = 0.2, 
however the on-line version of Tagme offers the possibility 
to modify this value. 

5.3 Comparing annotators on short texts 

We compare the best Tagme setting (namely dt plus 
AVG) against the Milne&Witten's system re-built (for fair- 
ness) over the same Wikipedia snapshot used by Tagme. 
Since we could not get access to Chakrabarti's systeirj^ we 
do not include it in this comparison. In any case, we recall 
that this system cannot be used in our context because it 
is very slow; it takes > 2 sees per 15 anchors [Toj. This 
is more than one order of magnitude slower than Tagme, 
which takes less than 2ms per anchor (see next Sect. |5.5| 
for details). Nevertheless, Chakrabarti's system will be con- 
sidered when annotating long texts because of the results 
reported in 10 , see next Sect. 5.4 



The first experiment compares the disambiguation phase 
of Tagme against the one of Milne&Witten's system over 
the testset of Wiki-Disamb30. To make the comparison 
wider, we considered also two other (simpler) disambigua- 
tors: one selects always the most-common sense from Pg{a) 
(i.e. statistically-driven choice), and the other randomly se- 
lects a page from Pg{a) (i.e. oblivious choice). 





Precision 


Recall 


F-measure 


Random 


32.2 


32.2 


32.2 


Most Common 


85.8 


86.8 


86.3 


Milne&Witten 


92.3 


84.6 


88.3 


DT (in Tagme) 


91.5 


90.9 


91.2 



Table 3: Performance of various disambiguation al- 
gorithms over the short texts of Wiki-Disamb30. 

Results are shown in Table[3] With respect to Milne&Witten, 
our disambiguator dt yields a significant improvement in 
Recall (-f6.5% absolute) and a slightly decrement in Preci- 
sion (-0.8% absolute), for an overall absolute improvement 
in F-measure of about 3%. This is due to our voting scheme 
that deploys the relatedness among the senses associated to 
all anchors in the short input text, and thus not only among 
the senses of the un-ambiguous anchors (which are possibly 
absent in short texts, as commented in Sect. |4.2||. 



As a final check for fairness, we also compared these re- 
sults against the ones presented by Milne&Witten in [l4] . In 
that paper, the authors evaluated their system over a collec- 
tion of 100 full-articles of Wikipedia, each containing at least 
50 links (for a total amount of IIK anchors). Their system 
yielded an overall F-measure on disambiguation of about 
96%, which is larger than the 88% reported in Table [3] The 
reason is that our texts are short and more ambiguous, and 
thus more difficult to be disambiguated: in fact, on our da- 
tasets the choice of a random sense gets F « 32% and the 
choice of the most-common sense gets F « 86%; whereas on 
Milne&Witten's dataset, these numbers were 53% and 90% 
respectively. This remarks further that the performance of 
Tagme's disambiguation is much efi'ective. 

^Chakrabarti's personal communication. 





p 




F- 


measure 


Milne&Witten 


69.32 


69.52 




69.42 


Tagme 


76.27 


76.08 




76.17 




Ptopics 


Rtopics 


F- 


measure 


Milne&Witten 


69.60 


69.80 




69.70 


Tagme 


78.41 


77.48 




77.94 



Table 4: Performance of annotators on short texts. 



Finally, we compared the overall annotation process in the 
two systems by using the testset of Wiki-Annot30. Results 
are reported in Table |4] where we notice that Tagme sig- 
nificantly improves Milne&Witten's system over short texts 
both in precision and recall, by about 8 — 9% absolute. The 
reason is that many features used by Milne&Witten's sys- 
tem are not effective on short texts: indeed, they considered 
features like location and frequency of anchors (which may 
be "undefined" or even misleading on short texts) , as well as 
they considered only the un-ambiguous anchors to compute 
a coherence-score (and these are often absent in short texts, 
as we commented above). 
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Figure 3: Performance of annotators on short texts, 
by varying the value of Pna. 

Since also the system by Milne&Witten offers the possi- 
bility to balance precision vs recall, we analyzed the impact 
of Pna on the performance of the two annotators. Figure [S] 
reports the comparison which shows that Tagme improves 
clearly the other approach for almost all values of Pna. 

5.4 Comparing annotators on long texts 

If the input text is long, we do not want to change Tagme's 
architecture because we want to obtain a software whose 
time complexity scales linearly with the number of anchors 
in the input text (see below). So we shift a text window 
of about 10 anchors over the long input text, and apply 
Tagme on each window in an incremental way. It is clear 
that this approach gives advantage to both Chakrabarti's 
and Witten&Milne's systems in terms of precision/recall of 
the annotation, because they deploy the full input text (and 
thus probably more than 10 anchors). Nevertheless, we de- 
cided to stick on this unfavorable setting for Tagme in order 
to stress its performance. 
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Figure 4: Performance of annotators on the long 
texts of WlKl-LONG. 
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Figure 5: Performance of Chakrabarti's annotator 
and Tagme over the iitb dataset. 



Our first experiment is on our dataset Wiki-Long (see 
Sect. |5.2[ ), and compared the only two available annota- 
tors: Tagme and Milne&Witten's system. Figure |4] re- 
ports the precision/recall curves as the value of Pna varies. 
Surprisingly TAGME improves Milne&Witten's system uni- 
formly when texts have from ten to hundreds of anchors to 
be annotated. Given these results we decided to dig into 
WlKl-LONG in order to evaluate the performance of the two 
systems as a function of the number of anchors in the long 
input text. For space reasons we cannot plot these results 
but we briefly state that, as expected, as this number grows 
the performance of Milne&Witten's system improves (and 
approaches an F-measure of about 74%, as stated in [m] 
for long and many-linked documents) whereas the perfor- 
mance of Tagme drops (and approaches 72% from the 78% 
achieved on short texts). It goes without saying that Tagme 
is designed for short texts, and its parameter settings have 
not been re-trained for this long-text case. We plan to in- 
vestigate deeply the case of long input-texts for Tagme in 
the near future, possibly designing a robust variant that dy- 
namically adapts its settings based on the length of the input 
text to be annotated. 

Our second experiment is aimed at deriving some infor- 
mation about the Chakrabarti's system. So, we downloaded 
their iitb dataset and run Tagme over it. This way we can 
use the performance figures reported in [lO] to compare all 
three known systems. Figure [S] reports only the performance 
of Chakrabarti's and Tagme's systems because, as reported 
Milne&Witten's system performs so poorly on this 
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dataset that its recall/precision curve is far to the left and 
thus is dropped to ease the reading of the figure. It is in- 
teresting to observe that, even with the severe limitation 
imposed by the shift-based approach, Tagme is competitive 
to Chakrabarti's system in terms of precision/recall figures, 
with the advantage of being more than one order of magni- 
tude faster (see below). 

5.5 On the time efficiency of TAGME 

The most time consuming step in Tagme's annotation is 
the calculation of the relatedness score of Sect. |4.2[ because 
anchor detection and other scores require time linear in the 
length of the input text T. If n is the number of anchors 
detected in T, s is the average number of senses potentially 
associated with each anchor, and din is the average in-degree 
of a Wikipedia page, then the time complexity of the overall 
annotation process is 0(di„ x (n x s)^). On our datasets of 



short texts it is n ~ 10, s « 5 and din ~ 50, so that our 
current implementation of Tagme takes 1.7ms per anchor 
and about 18ms per short text on a commodity PCQ This 
is more than one order of magnitude faster than the time 
performance reported in [lO]. Additionally, when Tagme is 
applied on long texts of L anchors (possibly L 2> 10), it 
slides and processes a window of about w = 10 anchors over 
the input text. This way, Tagme can re-compute incremen- 
tally the scores and thus pay Wcost = 0{din x ui x s^) time 
per window. This is 0{L x Wcost) time in total, which results 
linear in the number of anchors to be annotated. Conversely, 
Chakrabarti's system scales "mildly quadratically" in L, as 
stated in |10|. 
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Figure 6: Time performance of Tagme. 

Figure [6] shows the time performance of Tagme as the 
number of anchors in the input text growj^ The line "To- 
tal" indicates the average time taken by Tagme for the com- 
plete annotation process (i.e. parsing + disambiguation -|- 
pruning). When the input text has less than 10 anchors (i.e. 
it is short), the trend is roughly quadratic as predicted by the 
above analysis with a significant cost induced by the pars- 
ing step. As the number of anchors to be annotated grows, 
the time tends to grow linearly (as predicted) with a pars- 
ing cost which becomes almost negligible. We didn't report 



^In detail, anchor parsing takes about 6.2ms, disambigua- 
tion and pruning about 12.5ms per input text. Of course, 
code engineering may speed-up Tagme, and this will be ad- 
dressed in the future. 

^°Text fragments are drawn from Wiki-Annot30. 



the time performance of Milne&Witten's system because it 
is significantly slower than Tagme: it takes about 95ms per 
text on average, and this would have jeopardized the reading 
of the figure. Let it be said that Milne&Witten's software 
allows to set up a cache in order to speed up the annotation 
process. We tried this setting too, but it incurred in a large 
internal-memory allocation of about 1.4GB for the initializa- 
tion step, and then its Java heap-space overflowed the 2GB 
available on our PC just after the annotation of few thou- 
sands of short texts (i.e. 6-7K). Anyway, the performance 
obtained with the help of the cache (which is 18.36ms on 
avg per short text) is comparable with Tagme, but Tagme 
uses just 200MB of internal memory independently of the 
number of processed texts. 

6. CONCLUSION AND FUTURE WORKS 

In the light of the experiments conducted on Wikipedia- 
based datasets, one could reasonably ask: does Tagme achieve 
the same effective performance m the wild! There are two 
issues that let us argue positively about this: (1) the iitb 
dataset is a manually annotated set of news stories drawn 
from the Web, and there Tagme is superior either in preci- 
sion/recall or speed to the state-of-the-art systems (see Sect. 
5.41; (2) the user-study conducted in T\M confirmed that 
performance yielded over large datasets drawn from Wiki- 
pedia are good predictors of annotation performance m the 
wild, and indeed our datasets were larger and variegate than 
the ones used in that paper. In addition to these two pos- 
itive witnesses, we are currently setting up a much larger 
user-study over Mechanical Turlj"] with the twofold goal of 
creating a manually-annotated dataset much larger than the 
one offered by jlOj and extending the tests of Tagme. 

We believe that Tagme, like the systems of [lO][T4], has 
implications which go far beyond the enrichment of a text 
with explanatory links. We are currently investigating the 
impact of Tagme's annotation onto the performance of our 
past system SnakeT [i] for the on-the-fiy labeled cluster- 
ing of search-engine results (see also Clusty.com or Car- 
rot). In fact SnakeT, as most of its competitors (see 
e.g. pj), is based only on syntactic and statistical fea- 
tures and thus it could benefit from Tagme's annotation 
to improve the effectiveness of the labeling and the clus- 
tering phases. Furthermore, we are studying how other 
by-products of Wikipedia — such as DBpedia . org. Free- 
base . com, Kylin [Is] or YAGO [16] — could be used in Tagme 
to better relate and/or assign senses to text anchors. 

Finally, we plan to investigate the application of Tagme in 
Web Advertising: the explanatory links and the structured 
knowledge attached to plain-texts could allow the efficient 
and effective resolution of ambiguity and polysemy issues 
which often occur when advertiser's keywords are matched 
against the content of Web pages offering display-ads. 
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